676 research outputs found

    Multimedia big data computing for in-depth event analysis

    Get PDF
    While the most part of ”big data” systems target text-based analytics, multimedia data, which makes up about 2/3 of internet traffic, provide unprecedented opportunities for understanding and responding to real world situations and challenges. Multimedia Big Data Computing is the new topic that focus on all aspects of distributed computing systems that enable massive scale image and video analytics. During the course of this paper we describe BPEM (Big Picture Event Monitor), a Multimedia Big Data Computing framework that operates over streams of digital photos generated by online communities, and enables monitoring the relationship between real world events and social media user reaction in real-time. As a case example, the paper examines publicly available social media data that relate to the Mobile World Congress 2014 that has been harvested and analyzed using the described system.Peer ReviewedPostprint (author's final draft

    Access to vectors in multi-module memories

    Get PDF
    The poor bandwidth obtained from memory when conflicts arise in the modules or in the interconnection network degrades the performance of computers. Address transformation schemes, such as interleaving, skewing and linear transformations, have been proposed to achieve conflict-free access for streams with constant stride. However, this is achieved only for some strides. In this paper, we summarize a mechanism to request the elements in an out-of-order way which allows to achieve conflict-free access for a larger number of strides. We study the cases of a single vector processor and of a vector multiprocessor system. For this latter case, we propose a synchronous mode of accessing memory that can be applied in SIMD machines or in MIMD systems with decoupled access and execution.Peer ReviewedPostprint (published version

    Analysis of the overheads incurred due to speculation in a task based programming model

    Get PDF
    In order to efficiently utilize the ever increasing processing power of multi-cores, a programmer must extract as much parallelism as possible from a given application. However with every such attempt there is an associated overhead of its implementation. A parallelization technique is beneficial only if its respective overhead is less than the performance gains realized. In this paper we analyze the overhead of one such endeavor where, in SMPSs, speculation is used to execute tasks ahead in time. Speculation is used to overcome the synchronization pragmas in SMPSs which block the generation of work and lead to the underutilization of the available resources. TinySTM, a Software Transactional Memory library is used to maintain correctness in case of mis-speculation. In this paper, we analyze the affect of TinySTM on a set of SMPSs applications which employ speculation to improve the performance. We show that for the chosen set of benchmarks, no performance gains are achieved if the application spends more than 1% of its execution time in TinySTM.Peer ReviewedPostprint (published version

    An approach to task-based parallel programming for undergraduate students

    Get PDF
    This paper presents the description of a compulsory parallel programming course in the bachelor degree in Informatics Engineering at the Barcelona School of Informatics, Universitat Politècnica de Catalunya UPC-BarcelonaTech. The main focus of the course is on the shared-memory programming paradigm, which facilitates the presentation of fundamental aspects and notions of parallel computing. Unlike the “traditional” loop-based approach, which is the focus of parallel programming courses in other universities, this course presents the parallel programming concepts using a task-based approach. Tasking allows students to explore a broader set of parallel decomposition strategies, including linear, iterative and recursive strategies, and their implementation using the current version of OpenMP (OpenMP 4.5), which offers mechanisms (pragmas and intrinsic functions) to easily map these strategies into parallel programs. Simple models to understand the benefits of a task decomposition and the trade-offs introduced by different kinds of overheads are included in the course, together with the use of tools that allow an easy exploration of different task decomposition strategies and their potential parallelism (Tareador) and instrumentation and analysis of task parallel executions on real machines (Extrae and Paraver).This work has been supported by the grant SEV-2015-0493 of the Severo Ochoa Program, awarded by the Spanish Gov- ernment, by the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P) and by Generalitat de Catalunya (contracts 2014-MOOC-00057 and 2014-SGR-1051). We also thank the anonymous reviewers and editor for their comments during the review process, other professors that have been in- volved in the implementation of the course and Paul Carpenter at BSC for his corrections and suggestions to improve the text.Postprint (published version

    Compilació per a supercomputadors

    Get PDF

    TaskPoint: sampled simulation of task-based programs

    Get PDF
    Sampled simulation is a mature technique for reducing simulation time of single-threaded programs, but it is not directly applicable to simulation of multi-threaded architectures. Recent multi-threaded sampling techniques assume that the workload assigned to each thread does not change across multiple executions of a program. This assumption does not hold for dynamically scheduled task-based programming models. Task-based programming models allow the programmer to specify program segments as tasks which are instantiated many times and scheduled dynamically to available threads. Due to system noise and variation in scheduling decisions, two consecutive executions on the same machine typically result in different instruction streams processed by each thread. In this paper, we propose TaskPoint, a sampled simulation technique for dynamically scheduled task-based programs. We leverage task instances as sampling units and simulate only a fraction of all task instances in detail. Between detailed simulation intervals we employ a novel fast-forward mechanism for dynamically scheduled programs. We evaluate the proposed technique on a set of 19 task-based parallel benchmarks and two different architectures. Compared to detailed simulation, TaskPoint accelerates architectural simulation with 64 simulated threads by an average factor of 19.1 at an average error of 1.8% and a maximum error of 15.0%.This work has been supported by the Spanish Government (Severo Ochoa grants SEV2015-0493, SEV-2011-00067), the Spanish Ministry of Science and Innovation (contract TIN2015-65316-P), Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272), the RoMoL ERC Advanced Grant (GA 321253), the European HiPEAC Network of Excellence and the Mont-Blanc project (EU-FP7-610402 and EU-H2020-671697). M. Moreto has been partially supported by the Ministry of Economy and Competitiveness under Juan de la Cierva postdoctoral fellowship JCI-2012-15047. M. Casas is supported by the Ministry of Economy and Knowledge of the Government of Catalonia and the Cofund programme of the Marie Curie Actions of the EUFP7 (contract 2013BP B 00243). T.Grass has been partially supported by the AGAUR of the Generalitat de Catalunya (grant 2013FI B 0058).Peer ReviewedPostprint (author's final draft

    Main memory in HPC: do we need more or could we live with less?

    Get PDF
    This study analyzes the memory capacity requirements of important HPC benchmarks and applications. We find that the High Performance Conjugate Gradients benchmark could be an important success story for 3D-stacked memories in HPC, but High-performance Linpack is likely to be constrained by 3D memory capacity

    Hierarchical clustered register file organization for VLIW processors

    Get PDF
    Technology projections indicate that wire delays will become one of the biggest constraints in future microprocessor designs. To avoid long wire delays and therefore long cycle times, processor cores must be partitioned into components so that most of the communication is done locally. In this paper, we propose a novel register file organization for VLIW cores that combines clustering with a hierarchical register file organization. Functional units are organized in clusters, each one with a local first level register file. The local register files are connected to a global second level register file, which provides access to memory. All intercluster communications are done through the second level register file. This paper also proposes MIRS-HC, a novel modulo scheduling technique that simultaneously performs instruction scheduling, cluster selection, inserts communication operations, performs register allocation and spill insertion for the proposed organization. The results show that although more cycles are required to execute applications, the execution time is reduced due to a shorter cycle time. In addition, the combination of clustering and hierarchy provides a larger design exploration space that trades-off performance and technology requirements.Peer ReviewedPostprint (published version

    Hypernode reduction modulo scheduling

    Get PDF
    Software pipelining is a loop scheduling technique that extracts parallelism from loops by overlapping the execution of several consecutive iterations. Most prior scheduling research has focused on achieving minimum execution time, without regarding register requirements. Most strategies tend to stretch operand lifetimes because they schedule some operations too early or too late. The paper presents a novel strategy that simultaneously schedules some operations late and other operations early, minimizing all the stretchable dependencies and therefore reducing the registers required by the loop. The key of this strategy is a pre-ordering that selects the order in which the operations will be scheduled. The results show that the method described in this paper performs better than other heuristic methods and almost as well as a linear programming method but requiring much less time to produce the schedules.Peer ReviewedPostprint (published version
    corecore